clusterProfiler

Tim Vigers

1/25/23

Functional enrichment

  • Methods for interpreting results of various high-throughput omics studies.
  • Most omics analyses result in a list of genes (or proteins, metabolites, etc. that are associated with a gene) that are differentially expressed.
    • For example, comparing protein levels between two groups.
  • How do we understand a list of differentially expressed genes in biological context?

Functional enrichment terminology

  • Gene set: an unordered collection of functionally related genes (a pathway)1
  • Gene ontology (GO): a formal representation of three aspects of biological knowledge2
    • Molecular function
      • E.g., “catalysis” or “transport”
    • Cellular component
      • Either cellular compartments (e.g. “mitochondrion”) or stable macromolecular complexes (e.g. “ribosome”)

Functional enrichment terminology

  • Gene ontology (continued):
    • Biological processes
      • Larger processes made up of multiple molecular functions.
      • E.g., “signal transduction” or “glucose membrane transport”
      • Not necessarily equivalent to a pathway

GO graph

  • Organized as a directed acyclic graph (DAG)

KEGG: Kyoto Encyclopedia of Genes and Genomes

  • A “manually curated database resource integrating various biological objects categorized into systems, genomic, chemical and health information.”kanehisaKEGGTaxonomybasedAnalysis2023?
  • Sixteen databases in four broad categories:

KEGG Gluconeogenesis Pathway

Broad categories of KEGG pathways

  1. Metabolism
  2. Genetic Information Processing
  3. Environmental Information Processing
  4. Cellular Processes
  5. Organismal Systems
  6. Human Diseases
  7. Drug Development

Other gene sets

  • GO and KEGG are the most frequently used1
  • Can use any pathway database for these analyses
  • Alternatives include:
    • Disease Ontology (DO)
    • Disease Gene Network (DisGeNET)
    • wikiPathways
    • Molecular Signatures Database (MSigDb)

Over representation analysis (ORA)

  • Takes a list of differentially expressed genes and tests whether genes from various pathways are present in the list more often than expected.
  • Usually a hypergeometric testboyleGOTermFinderOpen2004?

Over representation analysis (ORA)

\[ p = 1 - \frac{\sum_{i=0}^{k-1}{M\choose i}{N-M \choose n-i}}{N \choose n} \]

  • \(N\) is the number of genes in the background distribution (usually all annotated genes)
  • \(n\) is the number of genes of interest
  • \(M\) is the number of genes annotated to the particular gene set (pathway) \(S\)
  • \(k\) is the number of genes of interest that are annotated to \(S\)

ORA example

  • A background of 10,000 genes, of which 260 are categorized as “axon guidance.”
  • We find that 1000 genes are differentially expressed, and 50 of those are categorized as “axon guidance.”
  • Is this statistically significant over representation?
phyper(q = 50-1,m = 260,n=10000-260,k=1000,lower.tail = F)
[1] 3.825066e-06
m = matrix(c(50,1000-50,260-50,10000-1000-260+50),nr=2)
fisher.test(m,alternative = "g")$p.value
[1] 3.825066e-06

ORA problems

  • Does not automatically account for the direction of differential expression
    • Can look at down-regulated and up-regulated genes separately.
  • Also does not take effect size into account
    • Will miss small but coordinated changes across lots of genes
      • These are likely more biologically relevant.

Gene set enrichment analysis (GSEA)

  • Basically, the goal is to test whether members of a gene set \(S\) are distributed randomly throughout a gene list \(L\).subramanianGeneSetEnrichment2005?
  • An enrichment score (ES) is calculated by:
    1. Walking down the gene list \(L\) (usually ranked by effect size/correlation with the phenotype).
    2. When a gene is in set \(S\), the ES increases, and decreases when not in \(S\).

Gene set enrichment analysis (GSEA)

  1. The final ES is the maximum deviation from 0, and corresponds to a weighted Kolmogorov–Smirnov-like statistic.subramanianGeneSetEnrichment2005?
  2. The statistical significance of the ES is estimated using an empirical phenotype-based permutation test. - Shuffling the phenotype preserves gene-gene correlations, and is better than shuffling gene labels.

clusterProfiler

  • Essentially, a set of wrapper functions that simplify functional enrichment analyses.

References

1.
Yu G. Biomedical Knowledge Mining Using GOSemSim and clusterProfiler. Accessed January 25, 2023. https://yulab-smu.top/biomedical-knowledge-mining-book/index.html
2.
Gene Ontology overview. Accessed January 25, 2023. http://geneontology.org/docs/ontology-documentation/